20/02/2020
Posted on flickr by BBVAtech in 2012, by Asigra CC BY 2.0
Image by Jennifer Dutcher, datascience@berkeley, source: https://datascience.berkeley.edu/what-is-big-data/
“Big Data is the result of collecting information at its most granular level — it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather.”
Jon Bruner (Editor-at-Large, O’Reilly Media)
“Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another.”
Annette Greiner
(Lecturer, UC Berkeley School of Information)
“[…] ‘big data’ will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value.”
Reid Bryant
(Data Scientist, Brooks Bell)
‘Big Data Landscape (2019)’, source: http://mattturck.com
Photo by Joe Parks, (CC BY-NC 2.0) source: https://flic.kr/p/e2umhv
Astronomy: SKA Radio Telescope
Image by SKA Organisation, source: https://www.skatelescope.org/multimedia/image
Astronomy: SKA Radio Telescope
Image by SKY Organisation, source: https://www.skatelescope.org/multimedia/image
Source: Bollen, Mao, and Zeng (2011)
Source: Bollen, Mao, and Zeng (2011)
Source: Ranco (2015)
Preparations
# read dataset into R
economics <- read.csv("../data/economics.csv")
# have a look at the data
head(economics, 2)
## date pce pop psavert uempmed unemploy ## 1 1967-07-01 507.4 198712 12.5 4.5 2944 ## 2 1967-08-01 510.5 198911 12.5 4.7 2945
# create a 'large' dataset out of this
for (i in 1:3) {
economics <- rbind(economics, economics)
}
dim(economics)
## [1] 4592 6
Compute the real personal consumption expenditures (pce): Divide each value of pce by the deflator 1.05.
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
pce_real <- c(pce_real, economics$pce[i]/deflator)
}
# look at the result
head(pce_real, 2)
## [1] 483.2381 486.1905
How long does it take?
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
system.time(
for (i in 1:n_obs) {
pce_real <- c(pce_real, economics$pce[i]/deflator)
})
time_elapsed
## user system elapsed ## 0.119 0.016 0.136
Assuming a linear time algorithm (\(O(n)\)), we need that much time for one additional row of data:
time_per_row <- time_elapsed[3]/n_obs time_per_row
## elapsed ## 2.961672e-05
If we deal with big data, say 100 million rows, that is
# in seconds (time_per_row*100^4)
## elapsed ## 2961.672
# in minutes (time_per_row*100^4)/60
## elapsed ## 49.36121
# in hours (time_per_row*100^4)/60^2
## elapsed ## 0.8226868
What happens in the background?
Can we improve this?
# Improve memory allocation (still somewhat ignorant of R) deflator <- 1.05 # define deflator n_obs <- length(economics$pce) pce_real <- list() # allocate memory beforehand # tell R how long the list will be length(pce_real) <- n_obs
Can we improve this?
# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs
# iterate through each observation
time_elapsed <-
system.time(
for (i in 1:n_obs) {
pce_real[[i]] <- economics$pce[i]/deflator
})
time_elapsed
## user system elapsed ## 0.008 0.000 0.008
Any improvements?
time_per_row <- time_elapsed[3]/n_obs time_per_row
## elapsed ## 1.74216e-06
# in seconds (time_per_row*100^4)
## elapsed ## 174.216
# in minutes (time_per_row*100^4)/60
## elapsed ## 2.9036
# in hours (time_per_row*100^4)/60^2
## elapsed ## 0.04839334
This looks much better, but we can do even better…
Can we further improve this?
# Do it 'the R wqy'
deflator <- 1.05 # define deflator
# Exploit R's vectorization!
time_elapsed <-
system.time(
pce_real <- economics$pce/deflator
)
# same result
head(pce_real, 2)
## [1] 483.2381 486.1905
# but much faster! time_elapsed
## user system elapsed ## 0 0 0
time_per_row <- time_elapsed[3]/n_obs
In fact, system.time() is not precise enough to capture the time elapsed…
# in seconds (time_per_row*100^4)
## elapsed ## 0
# in minutes (time_per_row*100^4)/60
## elapsed ## 0
# in hours (time_per_row*100^4)/60^2
## elapsed ## 0
Use microbenchmark::microbenchmark() to measure the elapsed time in microseconds (millionth of a second)
library(microbenchmark) # measure elapsed time in microseconds (avg.) time_elapsed <- summary(microbenchmark(pce_real <- economics$pce/deflator))$mean # per row (in sec) time_per_row <- (time_elapsed/n_obs)/10^6
Improvement with vectorization (again, assuming 100 million rows)
# in seconds (time_per_row*100^4)
## [1] 0.4982415
# in minutes (time_per_row*100^4)/60
## [1] 0.008304025
# in hours (time_per_row*100^4)/60^2
## [1] 0.0001384004
Walkowiak, Simon (2016): Big Data Analytics with R. Birmingham, UK: Packt Publishing.
Wickham, Hadley (2019): Advanced R. Second Edition, Boca Raton, FL: CRC Press
Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. “Twitter Mood Predicts the Stock Market.” Journal of Computational Science 2 (1): 1–8. https://doi.org/https://doi.org/10.1016/j.jocs.2010.12.007.
Ranco, Darko AND Caldarelli, Gabriele AND Aleksovski. 2015. “The Effects of Twitter Sentiment on Stock Price Returns.” PLOS ONE 10 (9). Public Library of Science: 1–21. https://doi.org/10.1371/journal.pone.0138441.